R² Score (r2_score)#
The coefficient of determination (R^2) is a regression metric that answers:
“How much better are my predictions than the simplest baseline that always predicts the mean?”
It’s scale-free and easy to interpret on a single dataset, but it has some gotchas (negative values, constant targets, and the fact that “variance explained” has conditions).
Learning goals#
By the end you should be able to:
derive and interpret (R^2)
understand why (R^2\in(-\infty, 1]) and what (R^2=0) means
implement
r2_scorefrom scratch in NumPy (including weights + multioutput)connect maximizing (R^2) to minimizing squared error
track (R^2) while fitting linear regression with gradient descent
Quick import#
from sklearn.metrics import r2_score
Prerequisites#
mean and variance
squared error
basic regression notation
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.metrics import r2_score as sk_r2_score
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Definition: compare to the mean baseline#
Given true targets (y_1,\dots,y_n) and predictions (\hat{y}_1,\dots,\hat{y}_n):
residuals: (e_i = y_i - \hat{y}_i)
baseline prediction: the mean (\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i)
Define the sum of squared errors and the total sum of squares:
[ \text{SSE} = \sum_{i=1}^n (y_i - \hat{y}i)^2,\qquad \text{SST} = \sum{i=1}^n (y_i - \bar{y})^2 ]
Then the (R^2) score is:
[ R^2 = 1 - \frac{\text{SSE}}{\text{SST}} ]
Interpretation:
(R^2 = 1): perfect predictions ((\text{SSE}=0))
(R^2 = 0): no improvement over predicting the mean ((\text{SSE}=\text{SST}))
(R^2 < 0): worse than predicting the mean ((\text{SSE}>\text{SST}))
So (R^2) is a relative-to-baseline score, not an “absolute error” measure.
# A small synthetic example: four predictors on the same y_true
n = 60
x = np.linspace(0, 10, n)
y_true = 1.5 + 0.8 * x + rng.normal(0, 1.2, size=n)
mean_pred = np.full_like(y_true, y_true.mean())
models = {
"Perfect": y_true,
"Good (small noise)": y_true + rng.normal(0, 0.5, size=n),
"Mean baseline": mean_pred,
"Worse-than-mean (anti)": 2 * y_true.mean() - y_true, # guarantees R² = -3
}
sst = np.sum((y_true - y_true.mean()) ** 2)
rows = []
for name, y_pred in models.items():
sse = np.sum((y_true - y_pred) ** 2)
rows.append((name, sse, 1 - sse / sst, sk_r2_score(y_true, y_pred)))
print("SST (baseline SSE) =", sst)
print("\nModel comparison")
print("-" * 70)
for name, sse, r2_manual, r2_sklearn in rows:
print(f"{name:24s} SSE={sse:8.2f} R²(manual)={r2_manual:7.3f} R²(sklearn)={r2_sklearn:7.3f}")
# Bar view: R² is a rescaling of SSE (on a fixed dataset)
names = [r[0] for r in rows]
sse_vals = np.array([r[1] for r in rows], dtype=float)
r2_vals = np.array([r[3] for r in rows], dtype=float)
sse_over_sst = sse_vals / sst
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("Relative error: SSE / SST", "R² = 1 - SSE/SST"),
)
fig.add_trace(go.Bar(x=names, y=sse_over_sst, name="SSE/SST", marker_color="#E45756"), row=1, col=1)
fig.add_hline(
y=1.0,
line_dash="dash",
line_color="black",
row=1,
col=1,
annotation_text="mean baseline",
annotation_position="top left",
)
fig.add_trace(go.Bar(x=names, y=r2_vals, name="R²", marker_color="#4C78A8"), row=1, col=2)
fig.add_hline(
y=0.0,
line_dash="dash",
line_color="black",
row=1,
col=2,
annotation_text="mean baseline",
annotation_position="top left",
)
fig.update_yaxes(title_text="SSE / SST", row=1, col=1)
fig.update_yaxes(title_text="R²", row=1, col=2)
fig.update_xaxes(tickangle=-20, row=1, col=1)
fig.update_xaxes(tickangle=-20, row=1, col=2)
fig.update_layout(title="Same ordering, different scale", height=420, showlegend=False)
fig.show()
SST (baseline SSE) = 403.6001913242214
Model comparison
----------------------------------------------------------------------
Perfect SSE= 0.00 R²(manual)= 1.000 R²(sklearn)= 1.000
Good (small noise) SSE= 9.27 R²(manual)= 0.977 R²(sklearn)= 0.977
Mean baseline SSE= 403.60 R²(manual)= 0.000 R²(sklearn)= 0.000
Worse-than-mean (anti) SSE= 1614.40 R²(manual)= -3.000 R²(sklearn)= -3.000
# Visual intuition: y_true vs y_pred (each panel shows a different model)
subplot_titles = [
f"{name}<br>R² = {sk_r2_score(y_true, y_pred):.3f}" for name, y_pred in models.items()
]
fig = make_subplots(rows=2, cols=2, subplot_titles=subplot_titles)
mn = float(np.min(y_true))
mx = float(np.max(y_true))
for k, (name, y_pred) in enumerate(models.items()):
row = 1 if k < 2 else 2
col = 1 if (k % 2) == 0 else 2
fig.add_trace(
go.Scatter(x=y_true, y=y_pred, mode="markers", name=name, showlegend=False),
row=row,
col=col,
)
fig.add_trace(
go.Scatter(
x=[mn, mx],
y=[mn, mx],
mode="lines",
line=dict(color="black", dash="dash"),
showlegend=False,
),
row=row,
col=col,
)
fig.update_xaxes(title_text="y_true", row=row, col=col)
fig.update_yaxes(title_text="y_pred", row=row, col=col)
fig.update_layout(
title="R² compares SSE to the mean baseline (R² can be negative)",
height=650,
)
fig.show()
Why the mean baseline makes sense#
The baseline prediction (\hat{y}_i = \bar{y}) minimizes squared error among all constant predictors.
Any constant (c) gives (\sum_i (y_i - c)^2).
The derivative w.r.t. (c) is (-2\sum_i(y_i-c)), which is 0 at (c=\bar{y}).
So (\text{SST}) is literally “the best SSE you can do without using features”.
That’s why (R^2=0) means “no better than doing nothing smarter than predicting the mean”.
# Visualize why the mean is the best constant predictor
c_grid = np.linspace(float(y_true.min()) - 2.0, float(y_true.max()) + 2.0, 200)
sse_grid = np.array([np.sum((y_true - c) ** 2) for c in c_grid])
fig = go.Figure()
fig.add_trace(go.Scatter(x=c_grid, y=sse_grid, mode="lines", name="SSE(c)"))
fig.add_vline(
x=float(y_true.mean()),
line_dash="dash",
line_color="black",
annotation_text="mean(y)",
annotation_position="top",
)
fig.update_layout(
title="Constant predictor: SSE is minimized at the mean",
xaxis_title="constant prediction c",
yaxis_title="SSE",
height=350,
)
fig.show()
2) Sums of squares and “variance explained”#
A common phrase is that (R^2) is the “fraction of variance explained”. That’s true in a specific (important) setting:
you fit ordinary least squares (OLS) with an intercept.
In that case, define the explained sum of squares:
[ \text{SSR} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 ]
Then we get the classic decomposition:
[ \text{SST} = \text{SSR} + \text{SSE} ]
and so
[ R^2 = 1 - \frac{\text{SSE}}{\text{SST}} = \frac{\text{SSR}}{\text{SST}}. ]
For arbitrary predictions (not OLS fitted values), the decomposition (\text{SST}=\text{SSR}+\text{SSE}) need not hold, but (R^2 = 1-\text{SSE}/\text{SST}) is still the standard definition used by sklearn.metrics.r2_score.
A useful rewrite (same dataset)#
If you define
(\text{MSE} = \frac{1}{n}\sum_i (y_i-\hat{y}_i)^2)
(\mathrm{Var}(y) = \frac{1}{n}\sum_i (y_i-\bar{y})^2) (population variance, with (n) not (n-1))
then
[ R^2 = 1 - \frac{\text{MSE}}{\mathrm{Var}(y)}. ]
This highlights something important: on a fixed dataset, maximizing (R^2) is equivalent to minimizing MSE/SSE.
# Show SST = SSR + SSE for an OLS fit with an intercept
n = 120
x = rng.uniform(-3, 3, size=n)
y = 2.0 + 1.5 * x + rng.normal(0, 1.0, size=n)
X = np.column_stack([np.ones(n), x]) # intercept + feature
w_ols, *_ = np.linalg.lstsq(X, y, rcond=None)
y_hat = X @ w_ols
y_bar = y.mean()
SST = np.sum((y - y_bar) ** 2)
SSE = np.sum((y - y_hat) ** 2)
SSR = np.sum((y_hat - y_bar) ** 2)
print("w_ols =", w_ols)
print(f"SST={SST:.3f} SSR={SSR:.3f} SSE={SSE:.3f} (SST-(SSR+SSE))={SST-(SSR+SSE):.3e}")
print(f"R² (manual) = {1 - SSE / SST:.4f}")
print(f"R² (sklearn) = {sk_r2_score(y, y_hat):.4f}")
w_ols = [2.0415 1.5628]
SST=1093.626 SSR=964.266 SSE=129.360 (SST-(SSR+SSE))=4.547e-13
R² (manual) = 0.8817
R² (sklearn) = 0.8817
# Variance decomposition as a stacked bar
fig = go.Figure()
fig.add_trace(go.Bar(name="Explained (SSR)", x=["SST"], y=[SSR], marker_color="#4C78A8"))
fig.add_trace(go.Bar(name="Residual (SSE)", x=["SST"], y=[SSE], marker_color="#E45756"))
fig.update_layout(
barmode="stack",
title=f"OLS with intercept: SST = SSR + SSE (R² = {1 - SSE / SST:.3f})",
yaxis_title="Sum of squares",
height=350,
)
fig.show()
3) From-scratch NumPy implementation#
Weighted + multioutput definitions#
For sample weights (w_i \ge 0), define the weighted mean:
[ \bar{y}w = \frac{\sum{i=1}^n w_i y_i}{\sum_{i=1}^n w_i} ]
Then (per output):
[ \text{SSE}w = \sum{i=1}^n w_i (y_i - \hat{y}_i)^2,\qquad \text{SST}w = \sum{i=1}^n w_i (y_i - \bar{y}_w)^2,\qquad R^2_w = 1 - \frac{\text{SSE}_w}{\text{SST}_w}. ]
For multioutput (y \in \mathbb{R}^{n\times m}), compute (R^2) per output and then aggregate.
Below is a small NumPy implementation that mirrors the core ideas in scikit-learn:
supports 1D targets ((n,)) and multioutput targets ((n, m))
supports
sample_weightsupports
multioutputaggregation:'raw_values'(per-output scores)'uniform_average'(simple mean)'variance_weighted'(weights each output by its (\text{SST}))
handles constant targets with
force_finite=True(sklearn’s default):perfect predictions (\Rightarrow 1.0)
imperfect predictions (\Rightarrow 0.0)
Note: with fewer than 2 samples, (R^2) is undefined; scikit-learn returns nan.
def r2_score_numpy(
y_true,
y_pred,
*,
sample_weight=None,
multioutput="uniform_average",
force_finite=True,
):
"""NumPy implementation of the R² score.
Parameters
----------
y_true, y_pred : array-like
Shape (n_samples,) or (n_samples, n_outputs)
sample_weight : array-like, optional
Shape (n_samples,)
multioutput : {'raw_values', 'uniform_average', 'variance_weighted'} or array-like
force_finite : bool
If True (default), replace non-finite scores for constant targets:
- perfect predictions -> 1.0
- imperfect predictions -> 0.0
Returns
-------
score : float or ndarray
"""
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
if y_true.shape != y_pred.shape:
raise ValueError(f"y_true and y_pred must have the same shape, got {y_true.shape} vs {y_pred.shape}")
if y_true.ndim == 1:
y_true_2d = y_true.reshape(-1, 1)
y_pred_2d = y_pred.reshape(-1, 1)
elif y_true.ndim == 2:
y_true_2d = y_true
y_pred_2d = y_pred
else:
raise ValueError(f"Expected 1D or 2D arrays, got ndim={y_true.ndim}")
n_samples, n_outputs = y_true_2d.shape
if n_samples < 2:
# R² is undefined with fewer than 2 samples (sklearn returns nan + a warning)
return np.nan
if sample_weight is None:
w = np.ones((n_samples, 1), dtype=float)
else:
w = np.asarray(sample_weight, dtype=float).reshape(-1, 1)
if w.shape[0] != n_samples:
raise ValueError(f"sample_weight must have shape (n_samples,), got {w.shape}")
if np.any(w < 0):
raise ValueError("sample_weight must be non-negative")
w_sum = np.sum(w)
if w_sum == 0:
raise ValueError("Sum of sample_weight must be > 0")
# Weighted mean per output
y_bar = np.sum(w * y_true_2d, axis=0) / w_sum
numerator = np.sum(w * (y_true_2d - y_pred_2d) ** 2, axis=0) # SSE per output
denominator = np.sum(w * (y_true_2d - y_bar) ** 2, axis=0) # SST per output
with np.errstate(divide="ignore", invalid="ignore"):
output_scores = 1.0 - (numerator / denominator)
if force_finite:
# Constant targets -> denominator == 0
mask = denominator == 0
if np.any(mask):
# Perfect prediction => numerator == 0 -> score 1.0
perfect = mask & (numerator == 0)
output_scores = output_scores.copy()
output_scores[perfect] = 1.0
output_scores[mask & ~perfect] = 0.0
# Multioutput aggregation
if multioutput == "raw_values":
scores = output_scores
elif multioutput == "uniform_average":
scores = float(np.mean(output_scores))
elif multioutput == "variance_weighted":
# Weight by SST (denominator)
denom_sum = float(np.sum(denominator))
scores = float(np.sum(output_scores * denominator) / denom_sum) if denom_sum > 0 else float(np.mean(output_scores))
else:
# array-like weights per output
weights = np.asarray(multioutput, dtype=float)
if weights.shape != (n_outputs,):
raise ValueError(f"multioutput weights must have shape (n_outputs,), got {weights.shape}")
if np.any(weights < 0):
raise ValueError("multioutput weights must be non-negative")
w_out_sum = float(np.sum(weights))
if w_out_sum == 0:
raise ValueError("Sum of multioutput weights must be > 0")
scores = float(np.sum(output_scores * weights) / w_out_sum)
if n_outputs == 1 and multioutput != "raw_values":
return float(scores)
return scores
# Quick checks vs scikit-learn
# 1D
y_true = rng.normal(size=50)
y_pred = y_true + rng.normal(0, 0.3, size=50)
print("1D")
print("numpy :", r2_score_numpy(y_true, y_pred))
print("sklearn:", sk_r2_score(y_true, y_pred))
# Multioutput
Y_true = rng.normal(size=(80, 3))
Y_pred = Y_true + rng.normal(0, 0.5, size=(80, 3))
w = rng.uniform(0.5, 2.0, size=80)
print("\nMultioutput (raw)")
print("numpy :", r2_score_numpy(Y_true, Y_pred, sample_weight=w, multioutput="raw_values"))
print("sklearn:", sk_r2_score(Y_true, Y_pred, sample_weight=w, multioutput="raw_values"))
print("\nMultioutput (uniform_average)")
print("numpy :", r2_score_numpy(Y_true, Y_pred, sample_weight=w, multioutput="uniform_average"))
print("sklearn:", sk_r2_score(Y_true, Y_pred, sample_weight=w, multioutput="uniform_average"))
print("\nMultioutput (variance_weighted)")
print("numpy :", r2_score_numpy(Y_true, Y_pred, sample_weight=w, multioutput="variance_weighted"))
print("sklearn:", sk_r2_score(Y_true, Y_pred, sample_weight=w, multioutput="variance_weighted"))
# Constant target edge case
print("\nConstant y_true")
y_const = np.ones(5)
print("perfect (force_finite=True):", r2_score_numpy(y_const, np.ones(5), force_finite=True))
print("bad (force_finite=True):", r2_score_numpy(y_const, np.zeros(5), force_finite=True))
print("perfect (force_finite=False):", r2_score_numpy(y_const, np.ones(5), force_finite=False))
print("bad (force_finite=False):", r2_score_numpy(y_const, np.zeros(5), force_finite=False))
1D
numpy : 0.8907791234783493
sklearn: 0.8907791234783493
Multioutput (raw)
numpy : [0.7509 0.7757 0.7692]
sklearn: [0.7509 0.7757 0.7692]
Multioutput (uniform_average)
numpy : 0.7652668866547568
sklearn: 0.7652668866547568
Multioutput (variance_weighted)
numpy : 0.764809940184828
sklearn: 0.764809940184828
Constant y_true
perfect (force_finite=True): 1.0
bad (force_finite=True): 0.0
perfect (force_finite=False): nan
bad (force_finite=False): -inf
4) Using R² while optimizing a model (from scratch)#
Because (\text{SST}) depends only on (y), it is constant w.r.t. model parameters (\theta) on a fixed dataset.
[ R^2(\theta) = 1 - \frac{\text{SSE}(\theta)}{\text{SST}} ]
So:
maximizing (R^2) (\Leftrightarrow) minimizing (\text{SSE}) (or MSE)
their gradients differ only by a constant scale:
[ \nabla_\theta R^2(\theta) = -\frac{1}{\text{SST}}\nabla_\theta \text{SSE}(\theta) ]
In practice we optimize MSE/SSE (a proper loss), and use (R^2) as a training/validation score.
Example: linear regression with gradient descent#
We’ll fit a line (\hat{y} = w_0 + w_1 x) by minimizing MSE, and track (R^2) over iterations.
# Data: y = w0 + w1 x + noise
n = 200
x = rng.uniform(-2, 4, size=n)
y = 1.0 + 2.5 * x + rng.normal(0, 1.0, size=n)
X = np.column_stack([np.ones(n), x])
# Closed-form (for reference)
w_closed, *_ = np.linalg.lstsq(X, y, rcond=None)
# Gradient descent
w = np.zeros(2)
lr = 0.05
n_steps = 300
r2_hist = []
mse_hist = []
w_hist = []
for _ in range(n_steps):
y_hat = X @ w
err = y_hat - y
mse = float(np.mean(err**2))
grad = (2.0 / n) * (X.T @ err)
r2_hist.append(r2_score_numpy(y, y_hat))
mse_hist.append(mse)
w_hist.append(w.copy())
w = w - lr * grad
w_hist = np.asarray(w_hist)
print("Closed-form w:", w_closed)
print("GD final w :", w)
print("Final R² :", r2_hist[-1])
Closed-form w: [0.9932 2.4779]
GD final w : [0.9932 2.4779]
Final R² : 0.9430322406480296
# Optimization diagnostics
iters = np.arange(n_steps)
fig = make_subplots(rows=1, cols=2, subplot_titles=("R² vs iteration", "MSE vs iteration"))
fig.add_trace(go.Scatter(x=iters, y=r2_hist, mode="lines", name="R²"), row=1, col=1)
fig.update_yaxes(title_text="R²", row=1, col=1)
fig.update_xaxes(title_text="iteration", row=1, col=1)
fig.add_trace(go.Scatter(x=iters, y=mse_hist, mode="lines", name="MSE", line=dict(color="#E45756")), row=1, col=2)
fig.update_yaxes(title_text="MSE", row=1, col=2)
fig.update_xaxes(title_text="iteration", row=1, col=2)
fig.update_layout(height=350, title="Gradient descent improves MSE and (equivalently) R²")
fig.show()
# Fit visualization: data + mean baseline + fitted line
x_line = np.linspace(x.min(), x.max(), 200)
X_line = np.column_stack([np.ones_like(x_line), x_line])
y_line_gd = X_line @ w
y_line_closed = X_line @ w_closed
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, mode="markers", name="data", marker=dict(size=6, opacity=0.6)))
fig.add_trace(go.Scatter(x=x_line, y=np.full_like(x_line, y.mean()), mode="lines", name="mean baseline", line=dict(dash="dash", color="gray")))
fig.add_trace(go.Scatter(x=x_line, y=y_line_closed, mode="lines", name="closed-form", line=dict(color="#4C78A8", width=3)))
fig.add_trace(go.Scatter(x=x_line, y=y_line_gd, mode="lines", name="gradient descent", line=dict(color="#F58518", width=3)))
fig.update_layout(
title=f"Linear regression fit (R² ≈ {r2_score_numpy(y, X @ w):.3f})",
xaxis_title="x",
yaxis_title="y",
height=450,
)
fig.show()
5) Pros, cons, and when to use#
Pros#
Baseline-relative: interpretable against a meaningful reference (predicting (\bar{y}))
Scale-free: shifting/scaling (y) (and predictions accordingly) does not change (R^2)
Widely used: default
.score()for many sklearn regressors
Cons / pitfalls#
Can be negative: a model can be arbitrarily worse than the mean baseline
Not defined for constant targets: if (\text{SST}=0),
sklearnusesforce_finite=Trueto map(nan, -inf)to(1.0, 0.0)“Variance explained” is conditional: the (\text{SST}=\text{SSR}+\text{SSE}) story holds for OLS with an intercept, not arbitrary predictions
Can reward overfitting: on training data, (R^2) almost always increases as you add features (even useless ones)
Outlier-sensitive: uses squared error, so large residuals dominate
When it’s a good choice#
comparing regression models on the same target and same dataset split
reporting an intuitive “better than mean?” summary alongside an absolute metric (MAE/RMSE)
model selection with cross-validated (R^2) rather than training (R^2)
When to be careful#
comparing across datasets with very different target variance
heavy-tailed noise / many outliers (consider MAE, Huber loss, or robust metrics)
non-stationary time series (always evaluate out-of-sample; consider rolling/forward validation)
6) Exercises#
Show that predicting (\bar{y}) is the best constant predictor by plotting SSE as a function of constant (c).
Create a case where (R^2) is strongly negative and explain it in terms of SSE vs SST.
Implement adjusted (R^2):
[ \bar{R}^2 = 1 - (1 - R^2)\frac{n-1}{n-d-1} ]
and compare it to (R^2) while adding random (useless) features. 4. Compute cross-validated (R^2) for a linear model vs a constant baseline.
References#
scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
Wikipedia: https://en.wikipedia.org/wiki/Coefficient_of_determination